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ABSTRACT 

In this paper we study the problem of how resihent networks are 
to node fauUs. Specifically, we investigate the question of how 
many faults a network can sustain so that it still contains a large 
(i.e. linear-sized) connected component that still has approximately 
the same expansion as the original fault-free network. For this we 
apply a pruning technique which culls away parts of the faulty net- 
work which have poor expansion. This technique can be applied to 
both adversarial faults and to random faults. For adversarial faults 
we prove that for every network with expansion a, a large con- 
nected component with basically the same expansion as the origi- 
nal network exists for up to a constant times a ■ n faults. This result 
is tight in the sense that every graph G of size n and uniform ex- 
pansion q(-), i.e. G has an expansion of a(n) and every subgraph 
G' of size m of G has an expansion of 0(a(m)), can be broken 
into sublinear components with u[ot(n) ■ n) faults. 

For random faults we observe that the situation is significantly 
different, because in this case the expansion of a graph only gives 
a very weak bound on its resilience to random faults. More specif- 
ically, there are networks of uniform expansion 0[^/n) that are 
resilient against a constant fault probability but there are also net- 
works of uniform expansion 0,(1/ log n) that are not resilient again- 
st a 0(1/ log n) fault probability. Thus, a different parameter is 
needed. For this we introduce the span of a graph which allows 
us to determine the maximum fault probability in a much better 
way than the expansion can. We use the span to show the first 
known results for the effect of random faults on the expansion of 
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d-dimensional meshes. 
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1. INTRODUCTION 

Communication in faulty networks is a classical field in network 
theory. In practice, one cannot expect nodes or communication 
links to work without complications. Software or hardware faults 
(or phenomena outside the control of a network operator such as 
caterpillars) may cause nodes or links to go down. To be able to 
adapt to faults without a serious degradation of the service, net- 
works and routing protocols have to be set up so that they are fault- 
tolerant. Fault-tolerant routing has recently attained renewed in- 
terest due to the tremendous rise in popularity of mobile ad-hoc 
networks and peer-to-peer networks. In these networks, faults are 
actually not an exception but a frequently occurring event: in mo- 
bile ad-hoc networks, users may run out of battery power or may 
move out of reach of others, and in peer-to-peer networks, users 
may leave without notice. 

Central questions in the theoretical area of faulty networks have 
been: 

• How many faults can a network sustain so that the size of its 
largest connected component is still a constant fraction of the 
original size? 

• How many faults can a network sustain so that it can still 
emulate its ideal counterpart with constant slowdown? 

The first question has been heavily studied in the graph theory 
community, and the second question has been investigated mostly 
by the parallel computing community to find out up to which point 
a faulty parallel computer can still emulate an ideal parallel com- 
puter with the same topology with constant slowdown. We refer 
the reader to L27J for a survey of results in these areas. 



1 . 1 Large connected components in faulty net- 
works 

We start with an overview of previous results for random faults 
and afterwards consider adversarial faults. 

Given a grapli G and a probability value p, let G'*'' be the ran- 
dom graph obtained from G by keeping each edge of G alive with 
probability p (i.e. p is the surx'ival probability). Given a graph G, 
let 7(G) G [0, 1] be the fraction of nodes of G contained in a largest 
connected component. 

Let Q = {G„ j n G IN} be any family of graphs with parameter 
n. Let p* be the critical probability for the existence of a linear- 
sized connected component. Le. for every constant e > it holds: 

1. For every p > (1 + e)p* there exists a constant c > with 

lim„_^Pr[7(Gi^>) > c] = 1. 

2. For all constants c > and for all p < ( 1 — e)p* it holds that 

lim„-^^Pr[7(Gi*'>) > c] = 0. 

Of course, it is not obvious whether critical probabilities exist. 
However, the results by Erdos and Renyi 1 10 1 and its subsequent 
improvements (e.g. 1511211 ') imply that for the complete graph on n 
nodes, p* = l/(n — 1), and that for a random graph with d ■ n/2 
edges, p* — 1/d. For the 2-dimensional n x n-mesh, Kesten 
showed thatp* ~ 1/2 1161 . Ajtai, Komlos and Szemeredi proved 
that for the hypercube of dimension n, p* = 1/n Q. For the n- 
dimensional butterfly network, Karlin, Nelson and Tamaki showed 
that 0.337 < p* < 0.436 I15|. Leighton and Maggs 1 17 1 showed 
that there is an indirect constant-degree network connecting n in- 
puts with n outputs via log n levels of n nodes each, called multi- 
butterfly, that has the following property: Up to a constant fault 
probability it is still possible to find O(logn) length paths from a 
constant fraction of the inputs to a constant fraction of the outputs. 
Subsequently Cole, Maggs and Sitaraman extended this result 
for the butterfly. 

Adversarial fault models have also been investigated. Leighton 
and Maggs il7i also showed that no matter how an adversary choos- 
es / nodes to fail, there will be a connected component left in the 
multibutterfly with at least n — 0{f) inputs and at least n — 0{f) 
outputs. (In fact, one can even still route packets between the in- 
puts and outputs in this component in almost the same amount of 
time steps as in the ideal case.) Subsequently Leighton, Maggs and 
Sitaraman 1191 extended this result for the butterfly. 

Upfal f28l, following up on work by Dwork et. al. 1 9 1 and Alon 
and Chung |2|, showed that there is also a direct constant-degree 
network on n nodes, a so-called expander, that has the property: no 
matter how an adversary chooses / nodes to fail, there will be a 
connected component left in it with at least n — 0{f) nodes. Both 
results are optimal up to constants. Upfal uses a pruning technique 
to achieve his bound which is similar in spirit to the one we use. 
Apart from the fact that Upfal gives a polynomial-time algorithm 
for pruning while we do not, the important difference worth not- 
ing is that Upfal's pruning does not guarantee a large component of 
good expansion. In fact, to the best of our knowledge there is no 
known constant approximation algorithm to determine the expan- 
sion of a graph of unknown topology. 

1.2 Simulation of fault-free networks by faulty 
networks 

Next we look at the problem of simulating fault-free networks 
by faulty networks. Consider the situation that there can be up to 
/ worst-case node faults in the system at any time. One way to 
check whether the largest remaining component still allows effi- 
cient communication is to check whether it is possible to embed 



into the largest connected component of a faulty network a fault- 
free network of the same size and kind. An embedding of a graph 
G into a graph H maps the nodes of G to non-faulty nodes of H and 
the edges of G to non-faulty paths in H. An embedding is called 
static if the mapping of the nodes and edges is fixed. Both static 
and dynamic embeddings have been used. A good embedding is 
one with minimum load, congestion, and dilation, where the load 
of an embedding is the maximum number of nodes of G that are 
mapped to any single node of H, the congestion of an embedding 
is the maximum number of paths that pass through any edge e of H, 
and the dilation of an embedding is the length of the longest path. 
The load, congestion, and dilation of the embedding determine the 
time required to emulate each step of G on H. In fact, Leighton, 
Maggs, and Rao have shown 1 18 1 that if there is an embedding of 
G into H with load £, congestion c, and dilation d, then H can 
emulate any communication step (and also computation step) on G 
with slowdown 0{£ + c + d). 

When demanding a constant slowdown, only a few results are 
known so far. In the case of worst-case faults, it was shown by 
Leighton, Maggs and Sitaraman (using dynamic embedding strate- 
gies) that an n-input butterfly with n}~'^ worst-case faults (for any 
constant e) can still emulate a fault-free butterfly of the same size 
with only constant slowdown fT9^. Furthermore, Cole, Maggs and 
Sitaraman showed that an nxn mesh can sustain up to n^^" worst- 
case faults and still emulate a fault-free mesh of the same size with 
(amortized) constant slowdown It seems that also the n-node 
hypercube can even achieve a constant slowdown for n^~'^ worst- 
case faults, but so far only partial answers have been obtained f 19|. 

Random faults have also been studied. For example, Hastad, 
Leighton and Newman 1 12 1 showed that if each edge of the hyper- 
cube fails independently with any constant probability p < 1, then 
the functioning parts of the hypercube can be reconfigured to sim- 
ulate the original hypercube with constant slowdown. Leighton, 
Maggs and Sitaraman 1191 showed that a butterfly network whose 
nodes fail with some constant probability p can still emulate a fault- 
free butterfly of the same size with slowdown 2°('°s* Interest- 
ingly, in the conference version of 1 7 1, Cole, Maggs and Sitaraman 
claim that an n x n mesh in which each node is faulty indepen- 
dently with a constant fault probability is able to emulate a fault- 
free mesh with a constant slowdown 1 8 1. The proof of this claim, 
which is stronger than the theorem we prove about the nxn mesh 
in this paper, is omitted in O and has not appeared elsewhere to 
the best of our knowledge. 

For a list of further references concerning embeddings of fault- 
free into faulty networks see the paper by Leighton, Maggs and 
Sitaraman II9I . 

1.3 Our approach 

The two common approaches - connectivity and emulation of 
fault-free by faulty networks - are too extreme for many practical 
applications. Knowing how long a network is still connected may 
not be very useful, because in extreme cases (just a single line con- 
nects one half to the other) the speed of communication may be 
reduced to a crawl, making it useless for applications that need a 
fast interaction or a large bandwidth such as interactive gaming or 
video conferences. On the other hand, emulating a fault free net- 
work on a faulty network is like using a giant hammer to crack a 
lesser nut, so to speak. Emulation may not be needed when all we 
want is reduced congestion or good expansion. 

Applications in ad-hoc networks or peer-to-peer systems usually 
do not care about how a network is connected, concerning them- 
selves instead with whether it still provides sufficient bandwidth 
and ensures sufficiently small delays. In this scenario a more rele- 



vant question is: 

How many faults can a network sustain so that it still 
contains a network of at least a constant fraction of 
its original size that still has approximately the same 
expansion ? 

Knowing an answer to this question would iiave many useful 
consequences for distributed data management, routing, and dis- 
tributed computing. Research on load balancing has shown that if 
the expansion basically stays the same, the ability of a network to 
balance single-commodity or multi-commodity load basically stays 
the same, and this ability can be exploited through simple local al- 
gorithms fill's^. Also, the ability of a network to route information 
is preserved because it is closely related to its expansion |26 1. Fur- 
thermore, as long as the original network still has a large connected 
component of almost the same expansion, one can still achieve al- 
most everywhere agreement which is an important prerequisite for 
fundamental primitives such as atomic broadcast, Byzantine agree- 
ment, and clock synchronization I9l l28ll4l . 

Many different fault models have been studied in the literature: 
faults may be permanent or transient, nodes and/or edges may break 
down, and faults may happen at random or may be caused by an ad- 
versary or attacker. The former faults are called random faults, and 
the latter faults are called adversarial faults. We will concentrate 
on situations in which there are static node faults, i.e. nodes either 
break down randomly or due to some adversary. For adversarial 
faults, we will consider the node expansion of a graph, and for ran- 
dom faults we will use the edge expansion of a graph. 

Given a graph G — {V,E) and a subset U V, the (node) 
expansion of U is defined as 



a{U) 
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where r{U) is the set of nodes in V \ (7 that have an edge from 
U and \S\ denotes the size of set S. The (node) expansion of G is 
defined as a = iiiin(j.|j7|<|v|/2 ce{U). 

Similarly, the edge expansion of G is defined as: 
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where {U, V \U) denotes the set of edges with one endpoint in U 
and the other inV \ U. 

1.4 Our main results 

Adversarial faults 

We give general upper and lower bounds for the number of node 
faults a graph can sustain so that it still has a large component with 
basically the same expansion, where the bounds are tight up to a 
constant factor. More specifically, we show that the number of ad- 
versarial node faults a graph with node expansion a and n nodes 
can sustain, with only a constant factor decrease in its expansion, 
is a constant times a ■ n. For graphs G of size n and uniform ex- 
pansion a( ), i.e. G has an expansion of a{n) and every subgraph 
G' of size m of G has an expansion of 0(a(m)), this result is best 
possible up to constant factors. 

Random faults 

We also study random faults. Our main contribution here is to sug- 
gest a new parameter for their study, which may be of independent 
interest. 



Consider a graph G — {V, E). Let f7 C T/ be any subset of 
nodes. U is defined to be compact if and only if U and V \ U are 
connected in G. Let U be the set of all compact sets of G. Let 
P{U) be the smallest tree in G which connects every node in r{U) 
(i.e. it essentially spans the boundary of U). Note that the set of 
nodes in P{U) need not be from U alone or from V \ U alone. 
Then the span of a graph is defined as: 



j\pm 

a — max < . ,,,, , 
ueu \ \r{U)\ 



(1) 



The span helps us characterize the resilience of the expansion to 
random faults. We show that a graph with maximum degree 5 and 
span a can tolerate a fault probability up to a constant times and 
still retain an expansion within a factor of S of its original expan- 
sion. 

We also show that the d-dimensional meshes have constant span. 
The proof of this theorem is of independent value as it establishes 
an interesting property of the d-dimensional mesh: The boundary 
of any set of connected vertices in the d-dimensional mesh, whose 
complement is also connected, can be spanned by a tree of size at 
most twice the size of the boundary. 

1.5 Outline of the paper 

The rest of the paper is organized as follows: In Section|2|we 
consider adversarial faults, and in Section |3| we consider random 
faults. The paper ends in Section |4| with a discussion of how our 
results are related to previous research and some open problems. 

2. ADVERSARIAL FAULTS 

In this section we prove the existence of a large connected com- 
ponent with good expansion in a graph with faulty nodes. We as- 
sume that a malicious adversary decides which nodes are faulty. 
More formally, we are given a network G — {V, E) with n nodes 
and vertex expansion a. An adversary gives us a faulty version of 
this network, called G/, with / faulty nodes removed. We will 
show that there exists a subnetwork of G/ called H which has 
Q{n) nodes and has an expansion of 0(a) provided that the ad- 
versary is given no more than 0(a ■ n) faults. 

We cannot argue that the expansion of G/ is no more than a con- 
stant factor less than a for the simple reason that the adversary can 
create bottlenecks in the network. However, we describe a way to 
find a large connected component of G/ with the required proper- 
ties using an algorithm called Prune described in Figure Q Note 
that the running time of Prune is not necessarily polynomial, nor 
are we claiming it is. Prune simply helps us prove an existential 
result. 

Before we get to the algorithm we need to introduce some nota- 
tion. We define T{S) to be the set of nodes in the neighbourhood of 
a subnetwork S. The algorithm generates a sequence of graphs Go 
to G,„. We now present the algorithm and state the main theorem 
of this subsection. 

Theorem 2.1. Given a network G with n nodes, node expan- 
sion a and f faulty nodes chosen by an adversary, for any constant 
k such that k > 2 and < j, Pmne(l — ^) returns a subnetwork 
H of at least size n — with expansion (1 — i) • a. 

Proof. Denote Gj \ H d& S. 5 is thus the union of all the 
regions culled by Prune. To prove the result we will first show that 
the size of S is bounded by ^ . To show this we will use the fact 
that the number of faults required to cull a region is proportional to 
the size of the region. To demonstrate that we need the following 
lemma. 



Algorithm Prune(e) 

1: Go^Gf;i^O 

2: while 3Si C d such that |r(S'i)| < q ■ e ■ ISi] and \S^\ < 
\G,\/2 



Gi+i «— Gi \ Si 
i + 1 
end while 

H ^ Gi; m ^ i 



Figure 1: The pruning algorithm 



Lemma 2.2. 

r( U ^0 

0<i<j 



< E |r(50l<a-(i-i)- 

0<i<j 



u ^» 

0<i<j 



Proof. Consider the first inequality. Obviously, any node that 
lies in the neighborhood of Si must lie in the neighborhood of 
some S,. Therefore V{\J^Si] C Uir(Si). Hence the first in- 
equality. Each set Si that is culled by Prune(l — ^) has the prop- 
erty that |r(Si)| < a ■ (1 — -1) • \Si\. Since the sets Si are disjoint, 
I'S'il = I Uj •S'il- Hence the second inequality. □ 

We will show that 5 < ^ by contradiction. Let, if possible, 
5 > Since at every iteration of the algorithm we pick an Si 
which is the smaller side of the cut we have found, each Si is at 
most n/2 in size. Now, since — < 5, there is a j such that either 

^ < |Uo<i<j < "-/2 or Sj such that ^ < \Sj \ < n/2. So 

we can always choose an 5' C 5 such that < \S'\ < n/2. In 
either case, from Lemma l2!2l we have: 



r(5') < 



We know that in G, |r(5')j is at least a ■ \S'\. Hence, the number 
of faulty nodes in 5''s neighborhood must be at least a(l — (1 — 
j)) ■ \S'\ i.e. greater than a ■ ■ ^ i.e. greater than /. Since 
the total number of faults allowed to the adversary is at most this 
number, we have a contradiction. Hence, H is at least n 



size and has expansion at least (1 



a. 



□ 



The result given in Theorem l2.1l is the best possible up to con- 
stant factors. To prove this we will first show that for every a > 
smaller than some constant there is an infinite family of graphs 
which disintegrate into sublinear components on removing some 
c - a - n vertices where n is the number of nodes in the given graph 
and c is some constant. Then we show that Theorem l2.1l is also the 
best possible up to constant factors for arbitrary graphs of uniform 
expansion. 

Theorem 2.3. There exists a constant f3 such that, given any 
a < (3, there is an infinite family of graphs with expansion a for 
which there is an adversarial selection of c ■ a ■ n faulty nodes 
causing the graph to break into sublinear components, where n is 
the number of nodes in the graph and c is an appropriately chosen 
constant. 



Proof. To construct this family of graphs let us consider G{n) 
to be an infinite family of expander graphs with constant expansion 
/3 and constant degree 5. 



For each G G G{n), construct a graph, H, which is a copy of 
G with each edge replaced by a chain of k nodes, where k is even. 
Then H has + n ^ 0{k ■ n) nodes. 

Claim 2.4. Graph H has expansion Q{^). 

Proof. Take any subset U of nodes in H representing original 
nodes in G and let U' be the set resulting from U by adding the 
fc/2 nearest nodes of each chain a node in U is connected to. Then 

\U'\ = + 1) ■ \U\ but |r(;7')| = |r(;7)| < S- \U\. Hence, 



a{U') = 



\U'\ 



\u'\ 



completing the proof of the claim. □ 

Now, from each chain of k nodes we remove the central node. 
Each component remaining has S ■ ^ nodes left, i.e. a sublinear 
number, and the total number of nodes removed is | ■ n, which is 
■i times the number of nodes in the graph. □ 

Recall that a graph G of size n is of uniform expansion «(■) if 
the expansion of G is a{n) and every subgraph G' of size m of G 
has an expansion of 0{a{m)). This is the case for all well-known 
classes of graphs. Consider, for example, the m x m-mesh with 
n — m? nodes and let a(m) = y/rn. Its expansion approximately 
^/n, and every subgraph of that mesh of size m has an expansion 
of 0{^/rn). Hence, it has a uniform expansion. 

Theorem 2.5. For every connected graph of size n and uni- 
form expansion a{x) there is an adversarial selection of Lo{a{n) ■ 
n) faulty nodes that causes the graph to break into sublinear com- 
ponents. 

Proof. Let G = {V, E) be any graph of uniform expansion 
ctix) that consists of n nodes. Then there must be a set Ui C V, 
\Ui\ < n/2, so that \r{Ui)\ < a{n) ■ \Ui\. Removing r{Ui) 
leaves G with a set Vi = {V , V"} of two node sets, V' — Ui and 
y" = F \ ( t/i U r ( f/i ) ) . Let Vi be a set in Vi of maximum size. It 
follows from the uniformity of G that there must be a set (72 C Vi, 
\U2\ < \Vi\/2, so that \r{U2)\ w.r.t. G(Vi) is 0{a{\Vi\)) ■ \U2\. 
Removing U2 results in a new set V2 of sets of nodes in which Vi is 
replaced by U2 and Vi\((72Ur((72))- We continue to take a node 
set Vi of largest size out of Vi and remove nodes at the minimum 
expansion part in G{Vi) until there is no subset in Vi left of size at 
least en. 

Our goal is to show that this process only removes 0(l£lIlZli . 
a{n) ■ n) nodes from G. If this is true, the theorem would follow 
immediately. We prove the bound with a charging strategy: Each 
time a set Vi is selected from Vi, we charge all nodes in r((7i+i) 
taken away from Vi to the nodes in Ui+i. Since 



ir((7H 



0(a(m)) ■ 1(7,4 



a(n) 



117, 



i+i 



for any a{x) > 1/x, this means that every node in I7i+i is charged 
with a value of 0(e~^ ■ ce{n)). Every node can be charged at most 
log(l/e) times because each time a node is charged, it ends up in 
a node set Ui+i that is at most half as large as Vi, and we stop 
splitting a node set once it is of size less than en. Hence, at the 
end, every node in V is charged with a value of (3( iSlIiZi) . q^j^))^ 
Summing up over all nodes, the total charge is 



O 



l0g(l/6) 



a{n) ■ n 



which represents the number of nodes that have been removed from 
the graph. □ 



3. RANDOM FAULTS 

We now direct our attention to tiie case of random faults. We 
assume tiiat eacii node in tlie networlc can independently become 
faulty with a given probability p. 

3.1 Random faults aren't (always) easier to 
handle 

Intuitively it appears that in general this situation might be easier 
to handle since there is no malicious adversarial intent behind the 
distribution of node failures. But, in general this does not seem to 
be true. We begin this section by showing that there are families 
of graphs for which a fault probability of 0(a) causes the graph to 
disintegrate into sublinear fragments, where a is the node expan- 
sion of the graph. In other words, in these graphs Q{an) random 
node failures can be catastrophic: they don't even allow us to find 
a linear sized connected component, hence making it impossible to 
find a linear sized connected component with good expansion. 

To construct this family of graphs we begin with an infinite fam- 
ily of constant degree expander graphs with a constant node expan- 
sion /3 and maximum degree 5. We denote this family as G{n). 

Theorem 3.1. Given any a < (3, there exists an infinite family 
of graphs with node expansion a for which a fault probability of 
^ ^ ■ a causes the graph to disintegrate. 

Proof. We use the family of graphs constructed in the proof of 
Theorem 12. 3 1 i.e. let G{n) be an infinite family of constant degree 
expander graphs with constant expansion (5 and degree 5. Construct 
a graph, H, which is a copy of G with each edge replaced by a 
chain of k nodes. Graph H has 0{k ■ n) nodes. From Claim 1241 
we know that H has expansion O(^). Excercise 5.7 of L23J gives 
us the following important property of H: 

Claim 3.2. The number of connected subgraphs of H with r 
vertices from G in them is at most n ■ 5^^. 



Proof. Any connected subgraph of size r can be spanned by a 
tree with r — 1 edges. This tree can be traversed by an Eulerian 
tour in which each edge is used at most twice. Hence the subgraph 
is represented by a walk along the graph of length at most 2r ver- 
tices from G. Since the root can be one of n vertices, the result 
follows. □ 

Let the failure probability of the nodes in H he p = . Con- 
sider any subgraph of H with r = In n vertices from G. The total 
number of nodes in this subgraph is at most S ■ k ■ r and at least 
k ■ r. Hence, this particular subgraph survives in H with probabil- 
ity at most (1 — p)* ' < 6^*° ' By Claim lT2l there are no more 
that n ■ S'^^ such components in H. Hence, the probability that such 
a subgraph survives is at most n ■ S^^ ■ e~'''^'^ — n^~^'"* < i. 
Since with high probability there can be no connected subgraph 
with size Q(5 ■ klnn) in H which has k ■ n vertices and 5 is a con- 
stant, we conclude that H breaks down into sublinear components 
with high probability. 

In the above construction, set fc = [^] for a given a < (3 and 
the theorem follows. □ 

However it isn't as if the expansion of the graph is a critical point 
for all graphs. There are several important classes of graphs which 
can sustain a much higher fault probability and still yield a linear 
sized connected component with good expansion. 



3.2 Extracting a subnetwork of size e(n) and 
edge expansion e(ae) 

We are given a network G = (V, E) with n nodes, edge expan- 
sion Qe and graph span a. Let us call the faulty version of this 
network G/. We want to find a network H C G/ of size Q{n) 
with edge expansion Q{ae). Let Li be the set of all compact sets 
of G. Note that a set is compact if both it and its complement are 
connected. We will use the notion of edge expansion in this section. 

Lemma 3.3. If S C G is connected and \S\ < n/2 then there 
exists a compact set Kg{S) in G whose edge expansion is no more 
than S 's edge expansion. 

Proof, li S e U then Kg{S) is simply S. \i S iU,G\ 
S is not connected. Let C{S) be the set of maximal connected 
subgraphs of G \ S. Let re(-) be the set of edges leaving a set. 
It is clear that C[S) C U (if not then they are not maximal). We 
consider two cases. 

Case 1: There is a G G C{S) with |G| > n/2. 

Then G\C €U, S C G\C,\G\G\ < n/2, and T^iG \ G) C 

re(5'). Hence, G \ G has an edge expansion less than 5"s edge 

expansion. So, Kg{S) — G\C. 

Case 2: For all G € C{S), \G\ < n/2. 

If any of the connected components in C{S) has a an edge expan- 
sion less than S"s then let that component be Kg{S). If not, then 
all components d £ C{S) have an edge expansion strictly larger 
than S"s, i.e. for all i, > But, r^{UiGi) = T^iS). 

Hence, 15*1 > |G \ 5*1, which is a contradiction. Therefore, one of 
the Gi 's must have an edge expansion less than or equal to S"s edge 
expansion. □ 



Algorithm Pmne2(e) 
1: Go^G/;j«-0 

2: while 3iSi, G, \ S^) in G, s.t. G, \ ^01 < • e • \Si\ 
and \ Si\ < |Gi|/2 and Si is connected 

3: K^<-KgAS^) 
4: Gi+i ^ Gi\Ki 
5: i ^ i + 1 

6: end while 

7: H^G^ 

Figure 2: The pruning algorithm 

We use notation from algorithm Prune! in the proof and state- 
ment of theorem l7!4l 

Theorem 3.4. Vmnelie) returns a subnetwork H of size \H\ > 
n/2 with edge expansion e ■ with high probability, provided that 

edge expansion, Ue > — — " , fault probability, p < ^^ ^4„ and 
degradation in expansion, e < 

Proof. Let T ^ Gf\H. Hence T is the union of all the culled 
regions. To prove the result we will show that with high probabil- 
ity the size of T is not more than n/2. Let {Ti,T2, . . . ,Ti} be 
maximal connected components of T. 

Claim 3.5. VT; G T, Ti is compact in G f. 



Proof. Suppose Ti is not compact in G/. Select the largest j 
such that Ti is not compact in Gj and Ti C Gj. (i.e. no part of T 
has been culled yet, which means that Gj+i is well-defined.) Let 
us consider two cases: 
Case 1: T C Gj+i 

This means that T must be compact in Gj+i else j could have 
been one higher. So, we have 3 components in Gj, namely: Kj, T 
and Gj+i \ T. Since T is noncompact in Gj, the neighborhood 
of Kj in Gj is wholly in T. Since Kj is disjoint with T, T is not 
maximal. Contradiction. 
Case 2: T % Gj+i 

This means that T and Kj are not disjoint. Since Kj is a culled 
set it must be wholly inside T, else T is not maximal. T is not 
compact in Gj, so T \ Kj is not compact in Gj+i. We know that 
Ti \ Kj will not be in //. Hence, all but one connected component 
(the one that contains H) in Gj+i \ T must belong to T. Hence 
Ti is not maximal. Contradiction. □ 

Let r(-) and r^(-) denote the node neighbourhoods in the fault- 
less graph and the faulty graph respectively. It is easy to see the 
following inequalities: ir(ri)| > and |r-'(r,)| < aee\Ti\. 

These two inequalities imply that |r-'(r,)| < e5\r{Ti)\. Note that 
any set T was culled by prune! because its edge neighbourhood 
fell by a factor of more than e. 

The probability that the neighbourhood of some connected set 
Ti in the faulty graph went down from V{T) to (T) is (for the 
sake of brevity, A := |r(Ti)| - |r-'(r,)|): 



a > 1. Hence, 

n 

Pi[3T, |r(roi > fc] < ^ n ■ 5^"-' ■ 5- 



^ 2 c-fe ^ 1 

< n ■ b < — 



Case 2: Vi, |r(rO| < k. 



Pr[r, is culled] < < ^-3 

Ts are disjoint by definition. Some T, and Tj might share a bad 
node in their neighbourhood leading to a dependency between them. 
But we do know that since the perimeter of each T is at most k—1, 
the maximum degree of the dependency graph between the Ts is 
S ■ {k — 1). Hence the dependency graph can be coloured with 
S-{k — l) + l<5-k colours. 

We know that | lj'=i Ti\ > n/2. Hence there has to be a colour 
class in the colouring of the dependency graph, let us call it C, such 
that the TiS in that colour class contain at least j^^k nodes. 

\Ti\ < —■ Hence, the number of distinct T^s in C has to be at 



other. We set a bound on such that this probability becomes 
small. LetOe > 2*ii^. 

Pr[yTi : Tiis bad] < Pr[VTi e C : T is bad] 



< S 5^^^ < 5 



-k/3 



[\Tf{Tiy ' '^ - 



Pr[nodes pruned > n/2] < Pr[Case 1] + Pr[Case 2] < 



ep 



l-eS 



(l-e5)lr(T.)| 



(2) 



Note that this is valid under the condition that ep + eS < 1. It turns 
out that we have flexibility in bounding these two terms. We want 
to set eS closest to 1 so that degradation in expansion is minimal. 
Therefore, if the following inequalities hold: 



eS < —, ep < }. 
- 2 ' ^ - 25*° 



then the probability that T is culled by prune2 is at most 5 ^ct | r (T; ) | 
(this is an upperbound on the RHS in|2}. 

Vr[T is culled] < 

We enumerate two cases on the size of the neighbourhood of Ts. 
In case 1 we argue that a T with a large neighbourhood is unlikely 
with high probability. In case 2 we show that if all T^s have small 
neighbourhoods then it is unlikely that T,i\Ti\ is more than with 

high probability. So, in case 2 assume that lULi^il > "■/2- Let 
fc = 3 log^ n in the following cases: 
Case 1: 3i, \r{Ti)\ > k. 

We know from before that the probability that a given compact sub- 
graph Ti is culled is at most J"^''''"'"^'" . We multiply this proba- 
bility with the number of ways of choosing such a subgraph. This 
gives us the probability that there is a T with such a large neigh- 
bourhood. Each compact subgraph has its corresponding perime- 
ter. Therefore, the number of compact subgraphs with boundary 
|r(Ti)| is at most the number of a ■ |r(ri)| sized spanning trees in 
the graph. This is at most n ■ J^'^ ''"^"^'". Note that by definition. 



□ 



3.3 Span of the mesh 

Theorem 3.6. The d-dimensional mesh has span 2. 



Proof. Consider a compact set 5* in the d-dimensional mesh 
Ai. Let B be the boundary nodes T{S). We place virtual edges 
between nodes in B. Two distinct nodes u — {uo, . . . Ud^i) and 
V = {vo, ■ ■ ■ Vd-i) have a virtual edge between them if \vi~Ui\ = 
for at least d — 2 of its dimensions and \ vi ~ Ui\ < 1 for the rest. 
Call the set of such virtual edges . In Lemma lTTl stated below, 
we claim that the graph {B,Ev) is connected. Therefore, we can 
find a spanning tree for B which has exactly — 1 virtual edges. 
Since each edge in can be simulated by exactly 2 edges of M, 
we can say that there is a spanning tree in AI for the nodes of B 
with at most 2 ■ (|_B| — 1) edges. □ 

Lemma 3.7. Let S (Z Z'^ be a finite compact set, let B be the 
boundary nodes r(S'), and let E^ be the set of virtual edges. Then 
the graph {B, Ev) is connected. 

Proof. We will show that for any two points u and v in B, 
there is a path in E^ connecting the two; if this can be done for 
every two points, then B is connected as we hope to prove. 

Our proof uses some basic and standard homology theory of cell 
complexes, which can be found in any introductory topology text; 
for instance, see 1 13 1. Specifically, we use the Z2 homology of d- 
dimensional Euclidean space R"^. We partition into a complex 
of unit hypercube cells having the points of Z"^ as their vertices. 
Each d-dimensional unit hypercube cell has as its boundary a set 



of 2d (d — 1) -dimensional unit hypercube facets, again having Z'' 
as vertices, and so on. In this complex, a k-chain is defined to 
be any finite set of fc-dimensional unit hypercubes having points 
of Z'* as vertices. The boundary of a fc-chain C is the symmetric 
difference of the boundaries of its hypercubes; that is, it is the set 
of (fc — 1) -dimensional hypercubes that are on the boundary of 
an odd number of the fc-dimensional hypercubes in C. A k-cycle 
is defined to be a fc-chain that has an empty boundary, and a k- 
boundary is defined to be a fc-chain that is the boundary of some 
(fc + l)-chain. For quite general classes of cell complexes in more 
complicated topological spaces than i?'*, every fc-boundary is a k- 
cycle, but in ii'*, the reverse is also known to be true: every fc-cycle 
is a fc-boundary. 

Now, given u and v, since 5* is connected we can find a path pi 
connecting u to t; by a sequence of adjacent points in 5*. We also 
find an edge ei connecting u to an adjacent point of Z'* outside 
S, an edge 62 connecting v to an adjacent point of Z'* outside 5*, 
and a path p2 connecting these two exterior points by a sequence 
of adjacent points outside S (since the complement of 5* is con- 
nected). The union of pi, p2, and {ei, 62} forms a 1-chain in the 
cubical complex described above. Moreover, this is a 1 -cycle, be- 
cause it has degree two at every vertex it touches. Therefore, it 
is the boundary of a 2-chain C; that is, C is a set of squares and 
pi U P2 U {ei, 62} is the set of edges in the cubical complex that 
touch odd numbers of squares in C. 

Next, let U be the subset of formed by a union of axis-aligned 
unit hypercubes, one for each member of S, and having that mem- 
ber as its centroid; note that these hypercubes do not have integer 
vertices. Let B be the boundary facets ofU; B consists of a collec- 
tion of {d — 1) -dimensional unit hypercubes that again do not have 
integer vertices. Finally, let G = B n C. 

Whenever a square s of C and a (d—1) -dimensional hypercube h 
of G meet, they do so in a line segment of length 1/2, that connects 
the centroid of h (where it is crossed by one edge of the square) to 
the centroid of one of its boundary (d— 2)-dimensional hypercubes. 
Thus G, the union of these line segments, can be viewed as a graph 
that connects vertices at these points. The degree of a vertex at the 
centroid of h is equal to the number of squares of C that touch that 
point, and the degree of the other vertices can only be two or four 
depending on which of the four vertices of the square defining the 
vertex is interior to U. 

Since the boundary of C crosses B only on the two edges ei 
and 62, these two crossing points have odd degree and all the other 
vertices of G have even degree. Any connected component of any 
graph must have an even number of odd-degree vertices, so the two 
odd vertices ei n B and 62 5 must belong to the same component 
and can be connected by a path pa in G. 

Each length- 1/2 segment of pa belongs to the boundary of a 
single hypercube in U, which has as its centroid a point of B. Let 
P4 be the sequence of centroids corresponding to the sequence of 
edges in pa. Then p4 starts at u, and ends at v. Further, at each 
step from one edge in p4, either the current point in B does not 
change, or it changes from one point in B to an adjacent point 
(when the corresponding pair of edges in p4 form a 180° angle on 
two adjoining hypercubes), or it changes from one point in B to 
a point at distance \/2 away (when the corresponding edges in p4 
form a 270° angle across a concavity on the boundary of U). 

So, we have constructed a path between an arbitrarily cho- 
sen pair of points u,v in B, and therefore the graph {B,Ev) is 
connected. □ 

Theorem B. 61 implies that the d-dimensional mesh can sustain a 
fault probability inversely polynomial in d and still have a large 
component whose expansion is no more than a factor of d worse 



than the original. 

4. CONCLUSION 

In this paper we presented a general technique for determin- 
ing the robustness of the expansion of different networks both for 
adversarial and random faults. For random faults we have come 
up with a new parameter, the span, which allows us to prove a 
strong result regarding the robustness of high dimensional meshes. 
Among other things, this result can provide useful insights into the 
working of peer-to-peer networks like CAN flS] which behaves 
like a rf-dimensional mesh in its steady state. Basically we have 
shown that CAN can tolerate a fault probability which is inversely 
polynomial in its dimension without losing too much in its expan- 
sion properties. 

For the 2-dimensional mesh our result is related to the line of 
research followed by Raghavan 1241 . Kaklamanis et. al. 1 14 1 and 
Mathies 1 22 1 who show that despite a constant fault probability (of 
as high as 0.4) a mesh with random failures can emulate a fault free 
mesh using paths with stretch factor at most 0(log n). Since the 
distance of nodes in a graph of expansion a is 0{a^^ log n) |,20i , 
our technique gives essentially the same result albeit with a lower 
fault probability. Additionally for meshes of constant dimension 
greater than 2 our results imply a 0(log n) dilation for path lengths, 
and hence a way to generalize these earlier results to higher dimen- 
sions. 

The strength of our technique is that it is able to yield results for 
the 2-dimensional mesh which are comparable to previous results 
while giving new results for higher dimensional meshes and pro- 
viding a general method suitable for analyzing any network whose 
span can be estimated. 

Open problems 

We conjecture that the butterfly, shuffle-exchange, and deBruijn 
network all have a span of 0(1), which means that they can tol- 
erate a constant fault probability. Though the span may provide 
tight results for these networks, the exponential dependency of the 
fault probability on the span does not really give useful results if 
the span is beyond log n. Hence, either a better dependency result 
is needed or a parameter better than the span is needed. Clearly, 
as mentioned in the introduction, having a parameter that can accu- 
rately describe the fault tolerance of graphs w.r.t. expansion under 
random faults would be very useful for many applications. 
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